Better Language Models with Model Merging 
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Abstract 

This paper investigates model merging, a tech- 
nique for deriving Markov models from text or 
speech corpora. Models are derived by starting 
with a large and specific model and by succes- 
sively combining states to build smaller and more 
general models. We present methods to reduce 
the time complexity of the algorithm and report 
on experiments on deriving language models for 
a speech recognition task. The experiments show 
the advantage of model merging over the standard 
bigram approach. The merged model assigns a 
lower perplexity to the test set and uses consider- 
ably fewer states. 

Introduction 

Hidden Markov Models are commonly used for 
statistical language models, e.g. i n part-of-spccc h 
tagging and speech recognition (Rabiner, 1989). 
The models need a large set of parameters which 
are induced from a (text-) corpus. The parameters 
should be optimal in the sense that the resulting 
models assign high probabilities to seen training 
data as well as new data that arises in an applica- 
tion. 

There are several methods to estimate model 
parameters. The first one is to use each word 
(type) as a state and estimate the transition prob- 
abilities between two or three words by using the 
relative frequencies of a corpus. This method is 
commonly used in speech recognition and known 
as word-bigram or word-trigram model. The rel- 
ative frequencies have to be smoothed to handle 
the sparse data problem and to avoid zero proba- 
bilities. 

The second method is a variation of the 
first method. Words are automatically grouped, 
e.g. by similarity o f distribution in the corpus 
(Perciraet al., 1993). The relative frequencies of 



pairs or triples of groups (categories, clusters) are 
used as model parameters, each group is repre- 
sented by a state in the model. The second method 



has the advantage of drastically reducing the num- 
ber of model parameters and thereby reducing the 
sparse data problem; there is more data per group 
than per word, thus estimates are more precise. 

The third method uses manually defined cate- 
gories. They are linguistically motivated and usu- 
ally called parts- of- speech. An important differ- 
ence to the second method with automatically de- 
rived categories is that with the manual defini- 
tion a word can belong to more than one cate- 
gory. A corpus is (manually) tagged with the cat- 
egories and transition probabilities between two or 
three categories are estimated from their relative 
frequencies. This met hod is commo nly used for 
part-of-speech tagging ( Church, 1988 ). 

The fourth method is a variation of the third 
method and is also used for part-of-speech tagging. 
This method does not need a pre-annotated corpus 
for parameter estimation. Instead it uses a lexicon 
stating the possible parts-of-speech for each word, 
a raw text corpus, and an initial bias for the tran- 
sition and output probabilities. The parameters 
are estimated by us ing the Baum- Welch algorithm 
( [Baum ct al., 197tJ ). The accuracy of the derived 
model depends heavily on the initial bias, but with 
a good choice results are compara ble to those of 
method three (Cutting et al., 1992). 

This paper investigates a fifth method for es- 
timating natural language models, combining the 
advantages of the methods mentioned above. It 
is suitable for both speech recognition and part- 
of-speech tagging, has the advantage of automat- 
ically deriving word categories from a corpus and 
is capable of recognizing the fact that a word be- 
longs to more than one category. Unlike other 
techniques it not only induces transition and out- 
put probabilities, but also the model topology, i.e., 
the number of states, and for each state the out- 
puts that have a non-zero probability. The method 
i s called model me rging and was introduced by 
( pmohundro, 1992| ). 

The rest of the paper is structured as fol- 
lows. We first give a short introduction to Markov 



models and present the model merging technique. 
Then, techniques for reducing the time complex- 
ity are presented and we report two experiments 
using these techniques. 

Markov Models 

A discrete output, first order Markov Model con- 
sists of 

• a finite set of states QU{q s , q e }, q s , q e g" Q, with 
q s the start state, and q e the end state; 

• a finite output alphabet E; 

• a (|Q| + 1) X (\Q\ + 1) matrix, specifying the 
probabilities of state transitions p(q'\q) between 
states q and q' (there are no transitions into q s , 
and no transitions originating in g e ); for each 
state q E Q U {q s }, the sum of the outgoing 
transition probabilities is 1, pW\q) = 

i; 

• a \Q\ x |E| matrix, specifying the output prob- 
abilities p(o~\q) of state q emitting output a; for 
each state q G Q, the sum of the output proba- 
bilities is 1, p(°1<7) = 1- 

A Markov model starts running in the start 
state q s , makes a transition at each time step, and 
stops when reaching the end state q e . The transi- 
tion from one state to another is done according 
to the probabilities specified with the transitions. 
Each time a state is entered (except the start and 
end state) one of the outputs is chosen (again ac- 
cording to their probabilities) and emitted. 

Assigning Probabilities to Data 

For the rest of the paper, we are interested in the 
probabilities which are assigned to sequences of 
outputs by the Markov models. These can be cal- 
culated in the following way. 

Given a model M, a sequence of outputs a = 
o"i . . . (Tfc and a sequence of states Q = q\ . . . qk (of 
same length) , the probability that the model run- 
ning through the sequence of states and emitting 
the given outputs is 

Pm{Q,o) = \^J\pM{<li\<li-l)PM{^i\<li^ PM(q e \qi) 

(with go = Qs)- A sequence of outputs can be emit- 
ted by more than one sequence of states, thus we 
have to sum over all sequences of states with the 
given length to get the probability that a model 
emits a given sequence of outputs: 



The probabilities are calcul ated very effic iently 
with the Viterbi algorithm (Viterbi, 1967). Its 
time complexity is linear to the sequence length 
despite the exponential growth of the search space. 

Perplexity 

Markov models assign rapidly decreasing proba- 
bilities to output sequences of increasing length. 
To compensate for different lengths and to make 
their probabilities comparable, one uses the per- 
plexity PP of an output sequence instead of its 
probability. The perplexity is defined as 

PPm{<?) = 1 ■ 
\/Pm(o) 

The probability is normalized by taking the k th 
root (k is the length of the sequence). Similarly, 
the log perplexity LP is defined: 



LP M {a) = log PP M (o-) = 



~ log Puip) 

k 



Pm{t) 



Q 



(Q,o-). 



Here, the log probability is normalized by dividing 
by the length of the sequence. 

PP and LP are defined such that higher per- 
plexities (log perplexities, resp.) correspond to 
lower probabilities, and vice versa. These mea- 
sures are used to determine the quality of Markov 
models. The lower the perplexity (and log perplex- 
ity) of a test sequence, the higher its probability, 
and thus the better it is predicted by the model. 

Model Merging 

Model merging is a technique for inducing model 
parameters for Markov model s from a text cor - 
pus. It was introduced in (Qmohundro, 1992 ) 
and ( gtolcke and Qmohundro, 1994| ) to induce 
models for regular languages from a few sam- 
ples, and ada pted to natural language models in 
( Brants, 1995 ). Unlike other techniques it not 
only induces transition and output probabilities 
from the corpus, but also the model topology, i.e., 
the number of states and for each state the out- 
puts that have non-zero probability. In n-gram 
approaches the topology is fixed. E.g., in a pos- 
n-gram model, the states are mostly syntactically 
motivated, each state represents a syntactic cate- 
gory and only words belonging to the same cate- 
gory have a non-zero output probability in a par- 
ticular state. However the n-gram-models make 
the implicit assumption that all words belonging 
to the same category have a similar distribution 
in a corpus. This is not true in most of the cases. 

By estimating the topology, model merging 
groups words into categories, since all words that 
can be emitted by the same state form a cate- 
gory. The advantage of model merging in this re- 
spect is that it can recognize that a word (the 




Figure 1: Model merging for a corpus S — {ab,ac,abac}, starting with the trivial model in a) and ending 
with the generalization (a(b\c)) + in e). Several steps of merging between model b) and c) are not shown. 
Unmarked transitions and outputs have probability 1. 



type) belongs to more than one category, while 
each occurrence (the token) is assigned a unique 
category. This naturally reflects manual syntac- 
tic categorizations, where a word can belong to 
several syntactic classes but each occurrence of a 
word is unambiguous. 

The Algorithm 

Model merging induces Markov models in the fol- 
lowing way. Merging starts with an initial, very 
general model. For this purpose, the maximum 
likelihood Markov model is chosen, i.e., a model 
that exactly matches the corpus. There is one 
path for each utterance in the corpus and each 
path is used by one utterance only. Each path 
gets the same probability l/u, with u the number 
of utterances in the corpus. This model is also 
referred to as the trivial model. Figure [l[a shows 
the trivial model for a corpus with words a, 6, c and 
utterances ab, ac, abac. It has one path for each of 
the three utterances ab, ac, and abac, and each 
path gets the same probability 1/3. The trivial 
model assigns a probability of p(S\M a ) — 1/27 
to the corpus. Since the model makes an im- 
plicit independence assumption between the ut- 
terances, the corpus probability is calculated by 
multiplying the utterance's probabilities, yielding 
1/3 ■ 1/3 ■ 1/3 = 1/27. 

Now states are merged successively, except for 
the start and end state. Two states are selected 
and removed and a new merged state is added. 
The transitions from and to the old states are redi- 
rected to the new state, the transition probabilities 
are adjusted to maximize the likelihood of the cor- 
pus; the outputs are joined and their probabilities 
are also adjusted to maximize the likelihood. One 
step of merging can be seen in figure [j].b. States 1 
and 3 are removed, a combined state 1,3 is added, 
and the probabilities are adjusted. 

The criterion for selecting states to merge is 
the probability of the Markov model generating 
the corpus. We want this probability to stay as 
high as possible. Of all possible merges (gener- 
ally, there are k(k — l)/2 possible merges, with k 
the number of states exclusive start and end state 
which are not allowed to merge) we take the merge 
that results in the minimal change of the probabil- 
ity. For the trivial model and u pairwise different 
utterances the probability is p(S\M tr i V ) = 1/u". 
The probability either stays constant, as in Figure 
[lib and c, or decreases, as in |l|.d and e. The prob- 
ability never increases because the trivial model is 
the maximum likelihood model, i.e., it maximizes 
the probability of the corpus given the model. 

Model merging stops when a predefined 
threshold for the corpus probability is reached. 
Some statistically motivated criteria for ter- 



mination using model priors are discussed in 
(Stolckc and Omohundro, 1994). 



Using Model Merging 

The model merging algorithm needs several opti- 
mizations to be applicable to large natural lan- 
guage corpora, otherwise the amount of time 
needed for deriving the models is too large. Gen- 
erally, there are 0(l 2 ) hypothetical merges to be 
tested for each merging step (I is the length of the 
training corpus). The probability of the training 
corpus has to be calculated for each hypothetical 
merge, which is 0{l) with dynamic programming. 
Thus, each step of merging is 0(l 3 ). If we want 
to reduce the model from size I + 2 (the trivial 
model, which consists of one state for each token 
plus initial and final states) to some fixed size, we 
need 0(1) steps of merging. Therefore, deriving a 
Markov model by model merging is 0(l 4 ) in time. 

( [Stolckc and Omohundro, 1994| ) discuss sev- 
eral computational shortcuts and approximations: 

1. Immediate merging of identical initial and final 
states of different utterances. These merges do 
not change the corpus probability and thus are 
the first merges anyway. 

2. Usage of the Viterbi path (best path) only in- 
stead of summing up all paths to determine the 
corpus probability. 

3. The assumption that all input samples retain 
their Viterbi path after merging. Making this 
approximation, it is no longer necessary to re- 
parse the whole corpus for each hypothetical 
merge. 

We use two additional strategies to reduce the 
time complexity of the algorithm: a series of cas- 
caded constraints on the merges and the variation 
of the starting point. 

Constraints 

When applying model merging one can observe 
that first mainly states with the same output are 
merged. After several steps of merging, it is no 
longer the same output but still mainly states that 
output words of the same syntactic category are 
merged. This behavior can be exploited by intro- 
ducing constraints on the merging process. The 
constraints allow only some of the otherwise pos- 
sible merges. Only the allowed merges are tested 
for each step of merging. 

We consider constraints that divide the states 
of the current model into equivalence classes. Only 
states belonging to the same class are allowed to 
merge. E.g., we can divide the states into classes 
generating the same outputs. If the current model 
has iV states and we divide them into k > 1 



nonempty equivalence classes C\ . . . Ck , then, in- 
stead of N(N — l)/2, we have to test 

k 



E 



\Ci\{\Ci 



1) < N{N 



1) 



i=i 

merges only. 

The best case for a model of size N is the 
division into N/2 classes of size 2. Then, only N/2 
merges must be tested to find the best merge. 

The best division into fc > 1 classes for some 
model of size N is the creation of classes that all 
have the same size N/k (or an approximation if 
N/k g IN). Then, 



N / N 



1 



• fc = 



N(f-l) 



2 2 
must be tested for each step of merging. 

Thus, the introduction of these constraints 
does not reduce the order of the time complexity, 
but it can reduce the constant factor significantly 
(see section about experiments). 

The following equivalence classes can be used 
for constraints when using untagged corpora: 

1. States that generate the same outputs (unigram 
constraint) 

2. unigram constraint, and additionally all prede- 
cessor states must generate the same outputs 
(bigram constraint) 

3. trigrams or higher, if the corpora are large 
enough 

4. a variation of one: states that output words be- 
longing to one ambiguity class, i.e. can be of a 
certain number of syntactic classes. 

Merging starts with one of the constraints. Af- 
ter a number of merges have been performed, the 
constraint is discarded and a weaker one is used 
instead. 

The standard n-gram approaches are special 
cases of using model merging and constraints. 
E.g., if we use the unigram constraint, and merge 
states until no further merge is possible under this 
constraint, the resulting model is a standard bi- 
gram model, regardless of the order in which the 
merges were performed. 

In practice, a constraint will be discarded be- 
fore no further merge is possible (otherwise the 
model could have been derived directly, e.g., by 
the standard n-gram technique). Yet, the ques- 
tion when to discard a constraint to achieve best 
results is unsolved. 



The Starting Point 

The initial model of the original model merging 
procedure is the maximum likelihood or trivial 



model. This model has the advantage of directly 
representing the corpus. But its disadvantage is 
its huge number of states. A lot of computation 
time can be saved by choosing an initial model 
with fewer states. 

The initial model must have two properties: 

1. it must be larger than the intended model, and 

2. it must be easy to construct. 

The trivial model has both properties. A class of 
models that can serve as the initial model as well 
are n-gram models. These models are smaller by 
one or more orders of magnitude than the trivial 
model and therefore could speed up the derivation 
of a model significantly. 

This choice of a starting point excludes a lot 
of solutions which are allowed when starting with 
the maximum likelihood model. Therefore, start- 
ing with an n-gram model yields a model that is 
at most equivalent to one that is generated when 
starting with the trivial model, and that can be 
much worse. But it should be still better than 
any n-gram model that is of lower of equal order 
than the initial model. 

Experiments 

Model Merging vs. Bigrams 

The first experiment compares model merging 
with a standard bigram model. Both are trained 
on the same data. We use N tra in — 14,421 
words of the Verbmobil corpus. The corpus 
consists of transliterated dialogues on business 
appointruentsu. The models are tested on N tes t = 
2, 436 words of the same corpus. Training and test 
parts are disjunct. 

The bigram model yields a Markov model with 
1,440 states. It assigns a log perplexity of 1.20 to 
the training part and 2.40 to the test part. 

Model merging starts with the maximum like- 
lihood model for the training part. It has 14,423 
states, which correspond to the 14,421 words (plus 
an initial and a final state). The initial log per- 
plexity of the training part is 0.12. This low value 
shows that the initial model is very specialized in 
the training part. 

We start merging with the same-output (uni- 
gram) constraint to reduce computation time. Af- 
ter 12,500 merges the constraint is discarded and 
from then on all remaining states are allowed to 



1 Many thanks to the Verbmobil project for pro- 
viding these data. We use dialogues that were 
recorded in 1993 and 94, and which are now avail- 
able fr om the Bavarian Archive for Speech Signals 



BAS ( |http:// www.phonetik.uni-muenchen.de/Bas/ 



BasHomeeng.html) . 
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Figure 2: Log Perplexity of the training part during merging. Constraints: same output until 12,500 / none 
after 12,500. The thin lines show the further development if we retain the the same-output constraint until 
no further merge is possible. The length of the training part is iV tra i n = 14, 421. 
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Figure 3: Log Perplexity of Test Part During Merging. Constraints: Same Output until 12,500 / none after 
12,500. The thin line shows the further development if we retain the same-output constraint, finally yielding 
a bigram model. The length of the test part is Attest = 2, 436. 



merge. The constraints and the point of changing 
the constraint are chosen for pragmatic reasons. 
We want the constraints to be as week as possi- 
ble to allow the maximal number of solutions but 
at the same time the number of merges must be 
manageable by the system used for computation 
(a SparcServerlOOO with 250MB main memory). 
As the following experiment will show, the exact 
points of introducing/discarding constraints is not 
important for the resulting model. 

There are N train (N tram - l)/2 ~ 10 s hypo- 
thetical hrst merges in the unconstraint case. This 
number is reduced to ~ 7 • 10 5 when using the 
unigram constraint, thus by a factor of ~ 150. 
By using the constraint we need about a week of 
computation time on a SparcServer 1000 for the 
whole merging process. Computation would not 
have been feasible without this reduction. 

Figure || shows the increase in perplexity dur- 
ing merging. There is no change during the first 
1,454 merges. Here, only identical sequences of 
initial and final states are merged (compare figure 
@.ato c). These merges do not influence the prob- 
ability assigned to the training part and thus do 
not change the perplexity. 

Then, perplexity slowly increases. It can never 
decrease: the maximum likelihood model assigns 
the highest probability to the training part and 
thus the lowest perplexity. 

Figure H also shows the perplexity's slope. It is 
low until about 12,000 merges, then drastically in- 
creases. At about this point, after 12,500 merges, 
we discard the constraint. For this reason, the 
curve is discontinuous at 12,500 merges. The effect 
of further retaining the constraint is shown by the 
thin lines. These stop after 12,983 merges, when 
all states with the same outputs are merged (i.e., 
when a bigram model is reached). Merging with- 
out a constraint continues until only three states 
remain: the initial and the final state plus one 
proper state. 

Note that the perplexity changes very slowly 
for the largest part, and then changes drastically 
during the last merges. There is a constant phase 
between and 1,454 merges. Between 1,454 and 
~11,000 merges the log perplexity roughly linearly 
increases with the number of merges, and it ex- 
plodes afterwards. 

What happens to the test part? Model merg- 
ing starts with a very special model which then is 
generalized. Therefore, the perplexity of some ran- 
dom sample of dialogue data (what the test part is 
supposed to be) should decrease during merging. 
This is exactly what we find in the experiment. 

Figure || shows the log perplexity of the test 
part during merging. Again, we find the disconti- 



Table 1 : Number of states and Log Perplexity for 
the derived models and an additional, previously 
test part, consisting of 9,784 words, (a) stan- 
dard bigram model, (b) constrained model merg- 
ing (first experiment), (c) model merging starting 
with a bigram model(second experiment) 





(a) 


(b) 


(c) 






model 


MM start 


type 


bigrams 


merging 


with bigrams 


# states 


1,440 


113 


113 


Log PP 


2.78 


2.41 


2.39 



nuity at the point where the constraint is changed. 
And again, we find very little change in perplex- 
ity during about 12,000 initial merges, and large 
changes during the last merges. 

Model merging finds a model with 113 states, 
which assigns a log perplexity of 2.26 to the test 
part. Thus, in addition to finding a model with 
lower log perplexity than the bigram model (2.26 
vs. 2.40), we find a model that at the same time 
has less than 1/10 of the states (113 vs. 1,440). 

To test if we found a model that predicts new 
data better than the bigram model and to be sure 
that we did not find a model that is simply very 
specialized to the test part, we use a new, previ- 
ously unseen part of the Verbmobil corpus. This 
part consists of 9,784 words. The bigram model 
assigns a log perplexity of 2.78, the merged model 
with 113 states assigns a log perplexity of 2.41 (see 
table |l|). Thus, the model found by model merging 
can be regarded generally better than the bigram 
model. 

Improvements 

The derivation of the optimal model took about 
a week although the size of the training part was 
relatively small. Standard speech applications do 
not use 14,000 words for training as we do in this 
experiment, but 100,000, 200,000 or more. It is 
not possible to start with a model of 100,000 states 
and to successively merge them, at least it is not 
possible on today's machines. Each step would 
require the test of « 10 9 merges. 

In the previous experiment, we abandoned 
the same-output constraint after 12,500 merges to 
keep the influence on the final result as small as 
possible. It can not be skipped from the begin- 
ning because somehow the time complexity has to 
be reduced. But it can be further retained, until 
no further merge under this constraint is possible. 
This yields a bigram model. The second experi- 
ment uses the bigram model with 1,440 states as 
its starting point and imposes no constraints on 
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Figure 4: Log Perplexity of training and test parts when starting with a bigram model. The starting point 
is indicated with o, the curves of the previous experiment are shown in thin lines. 



the merges. The results are shown in figure ^. 

We see that the perplexity curves approach 
very fast their counterparts from the previous ex- 
periment. The states differ from those of the pre- 
viously found model, but there is no difference in 
the number of states and corpus perplexity in the 
optimal point. So, one could in fact, at least in the 
shown case, start with the bigram model without 
loosing anything. Finally, we calculate the per- 
plexity for the additional test part. It is 2.39, 
thus again lower than the perplexity of the bigram 
model (see table . It is even slightly lower than 
in the previous experiment, but most probably due 
to random variation. 

The derived models are not in any case 
equivalent (with respect to perplexity), regardless 
whether we start with the trivial model or the bi- 
gram model. We ascribe the equivalence in the 
experiment to the particular size of the training 
corpus. For a larger training corpus, the optimal 
model should be closer in size to the bigram model, 
or even larger than a bigram model. In such a case 
starting with bigrams does not lead to an optimal 
model, and a trigram model must be used. 

Conclusion 

We investigated model merging, a technique to in- 
duce Markov models from corpora. The original 
procedure is improved by introducing constraints 
and a different initial model. The procedures are 
shown to be applicable to a transliterated speech 
corpus. The derived models assign lower perplex- 
ities to test data than the standard bigram model 
derived from the same training corpus. Addition- 



ally, the merged model was much smaller than the 
bigram model. 

The experiments revealed a feature of model 
merging that allows for improvement of the 
method's time complexity. There is a large ini- 
tial part of merges that do not change the model's 
perplexity w.r.t. the test part, and that do not in- 
fluence the final optimal model. The time needed 
to derive a model is drastically reduced by abbrevi- 
ating these initial merges. Instead of starting with 
the trivial model, one can start with a smaller, 
easy-to-produce model, but one has to ensure that 
its size is still larger than the optimal model. 
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